{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# 第20章 潜在狄利克雷分配"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## 习题20.1\n",
    "&emsp;&emsp;推导狄利克雷分布数学期望公式。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**\n",
    "\n",
    "1. 给出狄利克雷分布\n",
    "2. 写出狄利克雷分布的数学期望计算公式\n",
    "2. 利用概率分布的归一化性质，推导规范化因子的形式\n",
    "3. 将规范化因子代入原式，分别计算多元随机变量每个分量的期望"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：狄利克雷分布**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20章的定义20.2的狄利克雷分布的定义：\n",
    "\n",
    "> &emsp;&emsp;**定义20.2（狄利克雷分布）** 若多元连续随机变量$\\theta=(\\theta_1, \\theta_2, \\cdots, \\theta_k)$的概率密度函数为\n",
    "> $$\n",
    "p(\\theta|\\alpha) = \\frac{\\Gamma \\left(\\displaystyle \\sum_{i=1}^k \\alpha_i \\right)}{\\displaystyle \\prod_{i=1}^k \\Gamma(\\alpha_i)} \\prod_{i=1}^k \\theta_i^{\\alpha_i - 1} \\tag{20.2}\n",
    "$$\n",
    "> 其中$\\displaystyle \\sum_{i=1}^k \\theta_i = 1, \\ \\theta_i \\geqslant 0, \\ \\alpha=(\\alpha_1, \\alpha_2, \\cdots, \\alpha_k), \\ \\alpha_i > 0, \\ i=1,2,\\cdots,k$，则称随机变量$\\theta$服从参数为$\\alpha$的狄利克雷分布，记作$\\theta \\sim \\text{Dir}(\\alpha)$。\n",
    "> \n",
    "> &emsp;&emsp;令\n",
    "> $$\n",
    "\\text{B}(\\alpha) = \\frac{\\displaystyle \\prod_{i=1}^k \\Gamma(\\alpha_i)}{\\Gamma \\left(\\displaystyle \\sum_{i=1}^k \\alpha_i \\right)} \\tag{20.3}\n",
    "$$\n",
    "> 则狄利克雷分布的密度函数可以写成\n",
    "> $$\n",
    "p(\\theta | \\alpha) = \\frac{1}{\\text{B}(\\alpha)} \\prod_{i=1}^k \\theta_i^{\\alpha_i - 1} \\tag{20.4}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：写出狄利克雷分布的数学期望计算公式**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;已知概率分布$p(x)$数学期望的计算公式，即函数$x$在概率分布$p(x)$下的期望：$\\displaystyle E \\left[ p(x) \\right] = \\int x \\cdot p(x) \\text{d}x$。\n",
    "\n",
    "&emsp;&emsp;可得，狄利克雷分布的数学期望 \n",
    "$$\n",
    "E[p(\\theta | \\alpha)] = \\int \\theta \\cdot p(\\theta|\\alpha) \\text{d}\\theta = \\int \\theta \\cdot \\frac{1}{\\text{B}(\\alpha)} \\prod_{i=1}^k \\theta_i^{\\alpha_i-1} \\text{d}\\theta\n",
    "$$\n",
    "其中，多元连续随机变量$\\theta = (\\theta_1,\\theta_2,\\dots,\\theta_k), \\alpha = (\\alpha_1, \\alpha_2,\\dots, \\alpha_k)$  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;分别计算 $\\theta$ 在每个分量上的期望，即\n",
    "$$\n",
    "E[p(\\theta | \\alpha)] = (E[p(\\theta_1 | \\alpha)], E[p(\\theta_2 | \\alpha)],\\cdots, E[p(\\theta_k | \\alpha)])\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：利用概率分布的归一化性质，推导规范化因子的形式**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;对于每个分量$\\theta_i$的期望：\n",
    "$$\n",
    "\\begin{aligned}\n",
    "E[p(\\theta_i | \\alpha)] \n",
    "&= \\int \\theta_i \\cdot \\frac{1}{\\text{B}(\\alpha)} \\prod_{j=1}^k \\theta_j^{\\alpha_j-1}\\text{d}\\theta_i \\\\\n",
    "&= \\frac{1}{\\text{B}(\\alpha)} \\int \\theta_1^{\\alpha_1-1} \\theta_2^{\\alpha_2-1} \\cdots \\theta_i^{\\alpha_i} \\cdots \\theta_k^{\\alpha_k-1} \\text{d}\\theta_i\n",
    "\\end{aligned} \\tag{1}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;记上式右边的积分部可对应如下狄利克雷分布的密度函数\n",
    "$$\n",
    "\\begin{aligned}\n",
    "p(\\theta_i|\\alpha') \n",
    "&= \\frac{1}{\\text{B}(\\alpha')}\\theta_1^{\\alpha_1-1} \\theta_2^{\\alpha_2-1}\\dots \\theta_i^{\\alpha_i}\\dots \\theta_k^{\\alpha_k-1} \\\\\n",
    "&= \\frac{\\Gamma (\\alpha_1+\\alpha_2+\\dots+\\alpha_i+1+\\dots+\\alpha_k)}{\\Gamma(\\alpha_1)\\Gamma(\\alpha_2)\\dots\\Gamma(\\alpha_i+1)\\dots\\Gamma(\\alpha_k)}\\theta_1^{\\alpha_1-1} \\theta_2^{\\alpha_2-1}\\dots \\theta_i^{\\alpha_i}\\dots \\theta_k^{\\alpha_k-1}\n",
    "\\end{aligned}\n",
    "$$其中$\\alpha' = (\\alpha_1, \\alpha_2, \\cdots, \\alpha_i+1, \\cdots, \\alpha_k)$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据归一化条件：\n",
    "$$\n",
    "\\int p(\\theta_i|\\alpha') \\text{d}\\theta = 1\n",
    "$$\n",
    "&emsp;&emsp;可得：\n",
    "$$\n",
    "\\text{B}(\\alpha') = \\int \\theta_1^{\\alpha_1-1} \\theta_2^{\\alpha_2-1} \\cdots \\theta_i^{\\alpha_i} \\cdots \\theta_k^{\\alpha_k-1} \\text{d}\\theta \\tag{2}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：将规范化因子代入原式，分别计算多元随机变量每个分量的期望。**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;将公式（2）代入公式（1），可得分量$\\theta_i$的期望：\n",
    "$$\n",
    "\\begin{aligned}\n",
    "E[p(\\theta_i | \\alpha)] \n",
    "&= \\frac{\\text{B}(\\alpha')}{\\text{B}(\\alpha)} \\\\\n",
    "&= \\frac{\\Gamma(\\alpha_1) \\Gamma(\\alpha_2) \\cdots \\Gamma(\\alpha_i+1) \\cdots \\Gamma(\\alpha_k)}{\\Gamma (\\alpha_1+\\alpha_2+\\cdots+\\alpha_i+1+\\cdots+\\alpha_k)} \\cdot \\frac{\\Gamma (\\alpha_1+\\alpha_2+\\cdots+\\alpha_i+\\cdots+\\alpha_k)}{\\Gamma(\\alpha_1) \\Gamma(\\alpha_2) \\cdots \\Gamma(\\alpha_i) \\cdots \\Gamma(\\alpha_k)} \\\\\n",
    "&= \\frac{\\displaystyle \\alpha_i \\prod_{j=1}^k \\Gamma(\\alpha_j)}{\\displaystyle \\sum_{j=1}^k \\alpha_j \\Gamma \\left( \\sum_{j=1}^k \\alpha_j \\right)} \\cdot \\frac{\\displaystyle \\Gamma \\left( \\sum_{j=1}^k \\alpha_j \\right)}{\\displaystyle \\prod_{j=1}^k \\Gamma(\\alpha_j)} \\\\\n",
    "&= \\frac{\\alpha_i}{\\displaystyle \\sum_{j=1}^k \\alpha_j }\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;因此，狄利克雷分布的数学期望：\n",
    "$$\n",
    "\\begin{aligned}\n",
    "E[p(\\theta | \\alpha)] \n",
    "&= ( E[p(\\theta_1 | \\alpha)], E[p(\\theta_2 | \\alpha)], \\cdots, E[p(\\theta_k | \\alpha)] ) \\\\\n",
    "&= \\left( \\frac{\\alpha_1}{\\displaystyle \\sum_{i=1}^k \\alpha_i}, \\frac{\\alpha_2}{\\displaystyle \\sum_{i=1}^k \\alpha_i} ,\\cdots, \\frac{\\alpha_k}{\\displaystyle \\sum_{i=1}^k \\alpha_i} \\right)\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题20.2\n",
    "&emsp;&emsp;针对17.2.2节的文本例子，使用LDA模型进行话题分析。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**  \n",
    "\n",
    "1. 给出LDA吉布斯抽样算法\n",
    "2. 自编程实现基于吉布斯抽样算法的LDA模型\n",
    "3. 针对17.2.2节的文本例子，使用LDA模型进行话题分析"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：LDA吉布斯抽样算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20章的算法20.2的LDA吉布斯抽样算法：\n",
    "\n",
    "> **算法20.2（LDA吉布斯抽样算法）**  \n",
    "> 输入：文本的单词序列$w = \\{ w_1, w_2, \\cdots, w_m, \\cdots, w_M \\}, \\ w_m = (w_{m1}, w_{m2}, \\cdots, w_{mn}, \\cdots, w_{m N_m})$；  \n",
    "> 输出：文本的话题序列$z = \\{ z_1, z_2, \\cdots, z_m, \\cdots, z_M \\}, \\ z_m = (z_{m1}, z_{m2}, \\cdots, z_{mn}, \\cdots, z_{m N_m})$的后验概率分布$p(z | w, \\alpha, \\beta)$的样本计数，模型的参数$\\varphi$和$\\theta$的估计值；  \n",
    "> 参数：超参数$\\alpha$和$\\beta$，话题个数$K$。  \n",
    "> （1）设所有计数矩阵的元素$n_{mk}, n_{kv}$，计数向量的元素$n_m, n_k$初值为0；  \n",
    "> （2）对所有文本$w_m, \\ m=1,2,\\cdots, M$，对第$m$个文本中的所有单词$w_{mn},\\ n=1,2,\\cdots, N_m$，抽样话题$z_{mn} = z_k \\sim \\text{Mult}\\left(\\displaystyle \\frac{1}{K} \\right)$；增加文本-话题计数$n_{mk} = n_{mk} + 1$，增加文本-话题和计数$n_m = n_m + 1$，增加话题-单词计数$n_{kv} = n_{kv} + 1$，增加话题-单词和计数$n_k = n_k + 1$；  \n",
    "> （3）循环执行以下操作，直到进入燃烧期。对所有文本$w_m, m = 1, 2, \\cdots, M$，对第$m$个文本中的所有单词$w_{mn}, n = 1, 2, \\cdots, N_m$；  \n",
    "> &emsp;&emsp;&nbsp;&nbsp;（a）当前的单词$w_{mn}$是第$v$个单词，话题指派$z_{mn}$是第$k$个话题；减少计数$n_{mk} = n_{mk} - 1, n_m = n_m - 1, n_{kv} = n_{kv} -1, n_k = n_k - 1$；  \n",
    "> &emsp;&emsp;&nbsp;&nbsp;（b）按照满条件分布进行抽样\n",
    "> $$\n",
    "p(z_i | \\textbf{z}_{-i}, \\textbf{w}, \\alpha, \\beta) \\propto \\frac{n_{kv} + \\beta_v}{\\displaystyle \\sum_{v=1}^V (n_kv + \\beta_v)} \\cdot \\frac{n_{mk} + \\alpha_k}{\\displaystyle \\sum_{k=1}^K (n_{mk} + \\alpha_k)} \n",
    "$$\n",
    "> 得到新的第$k'$个话题，分配给$z_{mn}$；  \n",
    "> &emsp;&emsp;&nbsp;&nbsp;（c）增加计数$n_{mk'} = n_{mk'}  + 1, n_m = n_m + 1, n_{k'v} = n_{k'v} + 1, n_{k'} = n_{k'} + 1$；  \n",
    "> &emsp;&emsp;&nbsp;&nbsp;（d）得到更新的两个计数矩阵$N_{K \\times V} = [n_{kv}]$和$N_{M \\times K} = [n_{mk}]$，表示后验概率分布$p(\\textbf{z} | \\textbf{w}, \\alpha, \\beta)$的样本计数；  \n",
    "> （4）利用得到的样本计数，计算模型参数\n",
    "> $$\n",
    "\\theta_{mk} = \\frac{n_{mk} + \\alpha_k}{\\displaystyle \\sum_{k=1}^K (n_{mk} + \\alpha_k)} \\\\\n",
    "\\varphi_{kv} = \\frac{n_{kv} + \\beta_v}{\\displaystyle \\sum_{v=1}^V (n_{kv} + \\beta_v)}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：自编程实现基于吉布斯抽样算法的LDA模型**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "\n",
    "class GibbsSamplingLDA:\n",
    "    def __init__(self, iter_max=1000):\n",
    "        self.iter_max = iter_max\n",
    "        self.weights_ = []\n",
    "\n",
    "    def fit(self, words, K):\n",
    "        \"\"\"\n",
    "        :param words: 单词-文本矩阵\n",
    "        :param K: 话题个数\n",
    "        :return: 文本话题序列z\n",
    "        \"\"\"\n",
    "        # M, Nm分别为文本个数和单词个数\n",
    "        words = words.T\n",
    "        M, Nm = words.shape\n",
    "\n",
    "        # 初始化超参数alpha, beta，其中alpha为文本的话题分布相关参数\n",
    "        # beta为话题的单词分布相关参数\n",
    "        alpha = np.array([1 / K] * K)\n",
    "        beta = np.array([1 / Nm] * Nm)\n",
    "\n",
    "        # 初始化参数theta, varphi，其中theta为文本关于话题的多项分布参数，\n",
    "        # varphi为话题关于单词的多项分布参数\n",
    "        theta = np.zeros([M, K])\n",
    "        varphi = np.zeros([K, Nm])\n",
    "\n",
    "        # 输出文本的话题序列z\n",
    "        z = np.zeros(words.shape, dtype='int')\n",
    "\n",
    "        # (1)设所有计数矩阵的元素n_mk、n_kv，计数向量的元素n_m、n_k初值为 0\n",
    "        n_mk = np.zeros([M, K])\n",
    "        n_kv = np.zeros([K, Nm])\n",
    "        n_m = np.zeros(M)\n",
    "        n_k = np.zeros(K)\n",
    "\n",
    "        # (2)对所有M个文本中的所有单词进行循环\n",
    "        for m in range(M):\n",
    "            for v in range(Nm):\n",
    "                # 如果单词v存在于文本m\n",
    "                if words[m, v] != 0:\n",
    "                    # (2.a)抽样话题\n",
    "                    z[m, v] = np.random.choice(list(range(K)))\n",
    "                    # 增加文本-话题计数\n",
    "                    n_mk[m, z[m, v]] += 1\n",
    "                    # 增加文本-话题和计数\n",
    "                    n_m[m] += 1\n",
    "                    # 增加话题-单词计数\n",
    "                    n_kv[z[m, v], v] += 1\n",
    "                    # 增加话题-单词和计数\n",
    "                    n_k[z[m, v]] += 1\n",
    "\n",
    "        # (3)对所有M个文本中的所有单词进行循环，直到进入燃烧期\n",
    "        zi = 0\n",
    "        for i in range(self.iter_max):\n",
    "            for m in range(M):\n",
    "                for v in range(Nm):\n",
    "                    # (3.a)如果单词v存在于文本m，那么当前单词是第v个单词，\n",
    "                    # 话题指派z_mv是第k个话题\n",
    "                    if words[m, v] != 0:\n",
    "                        # 减少计数\n",
    "                        n_mk[m, z[m, v]] -= 1\n",
    "                        n_m[m] -= 1\n",
    "                        n_kv[z[m, v], v] -= 1\n",
    "                        n_k[z[m, v]] -= 1\n",
    "\n",
    "                        # (3.b)按照满条件分布进行抽样\n",
    "                        max_zi_value, max_zi_index = -float('inf'), z[m, v]\n",
    "                        for k in range(K):\n",
    "                            zi = ((n_kv[k, v] + beta[v]) / (n_kv[k, :].sum() + beta.sum())) * \\\n",
    "                                 ((n_mk[m, k] + alpha[k]) / (n_mk[m, :].sum() + alpha.sum()))\n",
    "\n",
    "                        # 得到新的第 k‘个话题，分配给 z_mv\n",
    "                        if max_zi_value < zi:\n",
    "                            max_zi_value, max_zi_index = zi, k\n",
    "                            z[m, v] = max_zi_index\n",
    "\n",
    "                        # (3.c) (3.d)增加计数并得到两个更新的计数矩阵的n_kv和n_mk\n",
    "                        n_mk[m, z[m, v]] += 1\n",
    "                        n_m[m] += 1\n",
    "                        n_kv[z[m, v], v] += 1\n",
    "                        n_k[z[m, v]] += 1\n",
    "\n",
    "        # (4)利用得到的样本计数，计算模型参数\n",
    "        for m in range(M):\n",
    "            for k in range(K):\n",
    "                theta[m, k] = (n_mk[m, k] + alpha[k]) / (n_mk[m, :].sum() + alpha.sum())\n",
    "\n",
    "        for k in range(K):\n",
    "            for v in range(Nm):\n",
    "                varphi[k, v] = (n_kv[k, v] + beta[v]) / (n_kv[k, :].sum() + beta.sum())\n",
    "\n",
    "        self.weights_ = [varphi, theta]\n",
    "        return z.T, n_kv, n_mk"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：针对17.2.2节的文本例子，使用LDA模型进行话题分析**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;17.2.2节的文本例子中的数据如下：\n",
    "<center>\n",
    "<img src=\"./images/18-3-Text-Dataset.png\" style=\"zoom: 55%;\">\n",
    "<br/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;使用LDA模型进行话题分析： "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "文本的话题序列z：\n",
      "[[0 0 2 2 0 0 0 0 0]\n",
      " [0 0 0 0 0 2 0 0 2]\n",
      " [0 2 0 0 0 0 0 2 0]\n",
      " [0 0 0 0 0 0 2 0 2]\n",
      " [2 0 0 0 0 2 0 0 0]\n",
      " [2 2 2 2 2 2 2 2 2]\n",
      " [2 0 2 0 0 0 0 0 0]\n",
      " [0 0 0 0 0 0 2 0 2]\n",
      " [0 0 0 0 0 2 0 0 2]\n",
      " [2 0 2 0 0 0 0 2 0]\n",
      " [0 0 0 2 2 0 0 0 0]]\n",
      "样本的计数矩阵N_KV：\n",
      "[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
      " [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n",
      " [2. 2. 2. 2. 2. 9. 2. 2. 2. 3. 2.]]\n",
      "样本的计数矩阵N_MK：\n",
      "[[0. 0. 4.]\n",
      " [0. 0. 2.]\n",
      " [0. 0. 4.]\n",
      " [0. 0. 3.]\n",
      " [0. 0. 2.]\n",
      " [0. 0. 4.]\n",
      " [0. 0. 3.]\n",
      " [0. 0. 3.]\n",
      " [0. 0. 5.]]\n",
      "模型参数varphi：\n",
      "[[0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091]\n",
      " [0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091]\n",
      " [0.067 0.067 0.067 0.067 0.067 0.293 0.067 0.067 0.067 0.1   0.067]]\n",
      "模型参数theta：\n",
      "[[0.067 0.067 0.867]\n",
      " [0.111 0.111 0.778]\n",
      " [0.067 0.067 0.867]\n",
      " [0.083 0.083 0.833]\n",
      " [0.111 0.111 0.778]\n",
      " [0.067 0.067 0.867]\n",
      " [0.083 0.083 0.833]\n",
      " [0.083 0.083 0.833]\n",
      " [0.056 0.056 0.889]]\n"
     ]
    }
   ],
   "source": [
    "gibbs_sampling_lda = GibbsSamplingLDA(iter_max=1000)\n",
    "\n",
    "# 输入文本-单词矩阵，共有9个文本，11个单词\n",
    "words = np.array([[0, 0, 1, 1, 0, 0, 0, 0, 0],\n",
    "                  [0, 0, 0, 0, 0, 1, 0, 0, 1],\n",
    "                  [0, 1, 0, 0, 0, 0, 0, 1, 0],\n",
    "                  [0, 0, 0, 0, 0, 0, 1, 0, 1],\n",
    "                  [1, 0, 0, 0, 0, 1, 0, 0, 0],\n",
    "                  [1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
    "                  [1, 0, 1, 0, 0, 0, 0, 0, 0],\n",
    "                  [0, 0, 0, 0, 0, 0, 1, 0, 1],\n",
    "                  [0, 0, 0, 0, 0, 2, 0, 0, 1],\n",
    "                  [1, 0, 1, 0, 0, 0, 0, 1, 0],\n",
    "                  [0, 0, 0, 1, 1, 0, 0, 0, 0]])\n",
    "\n",
    "K = 3  # 假设话题数量为3\n",
    "\n",
    "# 设置精度为3\n",
    "np.set_printoptions(precision=3, suppress=True)\n",
    "\n",
    "z, n_kv, n_mk = gibbs_sampling_lda.fit(words, K)\n",
    "varphi = gibbs_sampling_lda.weights_[0]\n",
    "theta = gibbs_sampling_lda.weights_[1]\n",
    "\n",
    "print(\"文本的话题序列z：\")\n",
    "print(z)\n",
    "print(\"样本的计数矩阵N_KV：\")\n",
    "print(n_kv)\n",
    "print(\"样本的计数矩阵N_MK：\")\n",
    "print(n_mk)\n",
    "print(\"模型参数varphi：\")\n",
    "print(varphi)\n",
    "print(\"模型参数theta：\")\n",
    "print(theta)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题20.3\n",
    "&emsp;&emsp;找出LDA的吉布斯抽样算法、变分EM算法中利用狄利克雷分布的部分，思考LDA中使用狄利克雷分布的重要性。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**  \n",
    "\n",
    "1. 写出LDA的吉布斯抽样算法中利用狄利克雷分布的部分\n",
    "2. 写出变分EM算法中利用狄利克雷分布的部分\n",
    "3. 写出LDA中使用狄利克雷分布的重要性"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;在话题模型中，假设话题是单词的多项分布，文本是话题的多项分布，而狄利克雷分布是多项分布的先验，在进行贝叶斯学习时，需要使用狄利克雷分布作为先验分布，因此要先找出LDA的吉布斯抽样算法和变分EM算法中的多项分布，再找出对应的狄利克雷分布。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：LDA的吉布斯抽样算法**   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.3节的LDA模型基本想法：\n",
    "\n",
    "> &emsp;&emsp;LDA模型的学习通常采用收缩的吉布斯抽样方法，基本想法是，通过对隐变量$\\theta$和$\\varphi$积分，得到边缘概率分布$p(w,z | \\alpha, \\beta)$（也是联合分布），其中变量$w$是可观测的，变量$z$是不可观测的；对后验概率分布$p(z | w, \\alpha, \\beta)$进行吉布斯抽样，得到分布$p(z | w, \\alpha, \\beta)$的样本集合；再利用这个样本集合对参数$\\theta$和$\\varphi$进行估计，最终得到LDA模型$p(z | w, \\alpha, \\beta)$的所有参数估计。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.3.2节的抽样分布的表达式：\n",
    "\n",
    "> $p(w, z| \\alpha, \\beta) = p(w | z, \\alpha, \\beta) p(z | \\alpha, \\beta) = p(w | z, \\beta) p(z | \\alpha) \\tag{20.21}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 处理第一个因子$p(w| z, \\beta)$\n",
    "\n",
    "&emsp;&emsp;该分布表示在给定话题$z$和话题-单词分布参数$\\varphi$下的文本的分布，是一个多项分布。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.3.2节的公式(20.22)：\n",
    "\n",
    "> $$\n",
    "p(w|z, \\varphi) = \\prod_{k=1}^K \\prod_{v=1}^V \\varphi_{kv}^{n_{kv}}\n",
    "$$\n",
    "> 其中$\\varphi_{kv}$表示第$k$个话题生成单词集合第$v$个单词的概率，$n_{kv}$是数据中第$k$个话题生成第$v$个单词的次数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;现引入该多项分布的共轭先验，有\n",
    "$$\n",
    "p(w, \\varphi|z, \\beta) \\propto p(w|z, \\varphi)p(\\varphi|\\beta)\n",
    "$$\n",
    "&emsp;&emsp;对上式两边求$\\varphi$的积分得：\n",
    "$$\n",
    "\\begin{aligned}\n",
    "p(w | z, \\beta) \n",
    "&= \\int p(w, \\varphi|z, \\beta) \\text{d} \\varphi \\\\\n",
    "&= \\int p(w|z, \\varphi)p(\\varphi|\\beta) \\text{d} \\varphi \\\\\n",
    "&= \\prod_{k=1}^K \\frac{\\text{B}(n_k + \\beta)}{\\text{B}(\\beta)}\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据假设 $p(w|z, \\beta), p(\\varphi|\\beta)$ 均是多项分布，其中第$k$个话题的单词分布参数$\\varphi_k$的先验分布为\n",
    "$$\n",
    "p(\\varphi_k|\\beta) = \\text{Dir}(\\varphi_k|\\beta)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.3.3节的公式（20.32）：\n",
    "\n",
    "> 2. 参数$\\varphi = \\{ \\varphi_k \\}$的估计  \n",
    "> 后验概率满足  \n",
    "> $$\n",
    "p(\\varphi_k|w, z, \\beta) = \\text{Dir} (\\varphi_k | n_k + \\beta)\n",
    "$$\n",
    "> 这里$n_k = \\{ n_{k1}, n_{k2}, \\dots, n_{kV}\\}$是第$k$个话题的单词的计数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. 处理第二个因子 $p(z | \\alpha)$\n",
    "\n",
    "&emsp;&emsp;该分布表示在给定文本的话题分布参数$\\theta$下话题的分布，是一个多项分布。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.3.2节的公式(20.24)：\n",
    "> $$\n",
    "p(z | \\theta) = \\prod_{m=1}^M \\prod_{k=1}^K \\theta_{mk}^{n_{mk}}\n",
    "$$\n",
    "> 其中$\\theta_{mk}$表示第$m$个文本生成第$k$个话题的概率，$n_{mk}$是数据中第$m$个文本生成第$k$个话题的次数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;现引入该多项分布的共轭先验，有\n",
    "$$\n",
    "p(z, \\theta | \\alpha) \\propto p(z | \\theta) p(\\theta | \\alpha)\n",
    "$$\n",
    "&emsp;&emsp;对上式两边求$\\theta$积分得：\n",
    "$$\n",
    "\\begin{aligned}\n",
    "p(z | \\alpha) \n",
    "&= \\int p(z, \\theta | \\alpha) \\text{d} \\theta \\\\\n",
    "&= \\int p(z | \\theta) p(\\theta | \\alpha) \\text{d} \\theta \\\\\n",
    "&= \\prod_{m=1}^M \\frac{\\text{B}(n_m + \\alpha)}{\\text{B}(\\alpha)}\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据假设 $p(z, \\theta|\\alpha), p(\\theta|\\alpha)$均是多项分布，其中第$m$个文本的话题分布参数$\\theta_m$的先验分布为\n",
    "$$\n",
    "p(\\theta_m|\\alpha) = \\text{Dir} (\\theta_m|\\alpha)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.3.3节的公式（20.30）：\n",
    "\n",
    "> 1. 参数$\\theta = \\{ \\theta_m \\}$的估计  \n",
    "> 后验概率满足  \n",
    "> $$\n",
    "p(\\theta_m|z_m, \\alpha) = \\text{Dir} (\\theta_m | n_m + \\alpha) \\tag{20.30}\n",
    "$$\n",
    "> 这里$n_m = \\{ n_{m1}, n_{m2}, \\cdots, n_{mK} \\}$是第$m$个文本的话题的计数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：LDA的变分EM算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.4.3节的变分EM算法：\n",
    "\n",
    "> &emsp;&emsp;将变分EM算法应用到LDA模型的学习上，首先定义具体的变分分布，推导证据下界的表达式，接着推导变分分布的参数和LDA模型的参数估计，最后给出LDA模型的变分EM算法。\n",
    "> 为简单起见，一次只考虑一个文本，记作$w$。文本的单词序列$w = (w_1, \\cdots, w_n, \\cdots, w_N)$，对应的话题序列$z = (z_1, \\cdots, z_n, \\cdots, z_N)$，以及话题分布$\\theta$，随机变量$w$，$z$和$\\theta$的联合分布是\n",
    "> $$\n",
    "p(\\theta, z, w | \\alpha, \\varphi) = p(\\theta | \\alpha) \\prod_{n=1}^N p(z_n | \\theta) p(w_n | z_n, \\varphi) \\tag{20.42}\n",
    "$$\n",
    "> 其中$w$是可观测变量，$\\theta$和$z$是隐变量，$\\alpha$和$\\varphi$是参数。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据第1步已证明$p(\\theta | \\alpha)$是狄利克雷分布，即\n",
    "$$\n",
    "p(\\theta|\\alpha) = \\text{Dir} (\\theta | \\alpha)\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;另一部分$\\displaystyle \\prod_{n=1}^N p(z_n | \\theta) p(w_n | z_n, \\varphi)$表示给定话题分布$\\theta$产生第$n$个文本的话题序列 $p(z_n|\\theta)$的概率以及给定话题单词分布参数$\\varphi$和第$n$个文本的话题序列$z_n$下产生第$n$个文本$w_n$的概率，根据第1步已证明均为多项分布。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.4.3节的公式(20.43)：\n",
    "> $$\n",
    "q(\\theta, \\textbf{z} | \\gamma, \\eta) = q(\\theta | \\gamma) \\prod_{n=1}^N q(z_n | \\eta_n) \\tag{20.43}\n",
    "$$\n",
    "> 其中$\\gamma$是狄利克雷分布参数，$\\eta=(\\eta_1, \\eta_2, \\cdots, \\eta_n)$是多项分布参数，变量$\\theta$和$\\textbf{z}$的各个分量都是条件独立的。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;由于 $w$是可观测变量，变分EM算法就是利用上式的变分分布$q(\\theta, z | \\gamma, \\eta)$来近似后验分布$p(\\theta, z | w, \\alpha, \\varphi)$。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20.4.3节的公式(20.44)：\n",
    "\n",
    "> &emsp;&emsp;由此得到一个文本的证据下界\n",
    "> $$\n",
    "L(\\gamma, \\eta, \\alpha, \\varphi) = E_q[\\log p(\\theta, z, w | \\alpha, \\varphi)] - E_q[\\log q(\\theta, z | \\gamma, \\eta)]\n",
    "$$\n",
    "> 展开证据下界式\n",
    "> $$\n",
    "\\begin{aligned}\n",
    "L(\\gamma, \\eta, \\alpha, \\varphi) \n",
    "&= E_q[\\log p(\\theta|\\alpha)] + E_q[\\log p(z | \\theta)] \\\\ \n",
    "& \\quad + E_q[\\log p(w | z, \\varphi)] - E_q[\\log q(\\theta | \\gamma)] - E_q[\\log q( z | \\eta)]\n",
    "\\end{aligned}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;每一项都是对应的概率分布关于变分分布$q(\\theta, z | \\gamma, \\eta)$的期望，逐一分析：\n",
    "1. 第一项包含$p(\\theta|\\alpha)$，含有狄利克雷分布；\n",
    "2. 第二项$p(z | \\theta)$为多项分布，其中参数$\\theta$在变分分布中已假设为狄利克雷分布$q(\\theta|\\gamma)=\\mathrm{Dir}(\\theta|\\gamma)$，含有狄利克雷分布；\n",
    "3. 第三项$p(w | z, \\varphi)$为多项分布，其中参数$\\theta$的分布可以根据归一化性质消掉，剩余部分均为多项分布，因此不含狄利克雷分布；\n",
    "4. 第四项$q(\\theta|\\gamma)$ 已假设为狄利克雷分布；\n",
    "5. 第五项$q(z | \\eta)$，其中参数$\\theta$ 的分布根据归一化性质消掉，剩余均为多项分布，故不含狄利克雷分布。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：LDA中使用狄利克雷分布的重要性**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;由于LDA话题模型天然具备多项分布的性质，因此在进行贝叶斯学习时，则不可避免地需要引入狄利克雷分布作为多项分布的共轭先验，没有狄利克雷分布就无法进行话题模型的贝叶斯估计。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题20.4\n",
    "&emsp;&emsp;给出LDA的吉布斯抽样算法和变分EM算法的算法复杂度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**  \n",
    "\n",
    "1. 计算LDA的吉布斯抽样算法的算法复杂度\n",
    "\n",
    "2. 计算LDA的变分EM算法的算法复杂度：根据书上409页算法20.4和411页算法20.5，同样计算循环次数以及每次循环的代价，注意这里需要分别考虑 E 步和 M 步的代价，最后加起来即为 LDA 变分EM算法的复杂度。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：LDA的吉布斯抽样算法的算法复杂度**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20章的算法20.2（LDA吉布斯抽样算法），假设迭代次数为$T$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. 第1、2步均为初始化相关，对于所有文本 $\\textbf{w}_m$，需要进行$M$次循环，即文本数量；对于第$m$个文本中的所有单词 $w_{mn}$，需要进行$N_m$次循环，即第$m$个文本包含的单词数，因此一共进行$T \\cdot M \\cdot N_m$次循环，故复杂度为$T \\cdot M \\cdot N_m$\n",
    "\n",
    "2. 第3步为迭代，每次循环过程中，第3.a、3.c和3.d步均为固定更新，算法复杂度均为1\n",
    "\n",
    "3. 第3.b步涉及到随机采样，该公式第一个因子表示话题生成该位置单词的概率，计算复杂度为$V$，即词典大小；第二个因子表示该位置的文本生成话题的概率，计算复杂度为$K$，即话题数量；由于词典大小远大于话题数量，这里可以近似认为计算复杂度为$V$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;因此，总算法复杂度为$T \\cdot M \\cdot N_m \\cdot V$。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：LDA的变分EM算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20章的算法20.5（LDA的变分EM算法），假设迭代次数为$T$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. E步：估计变分参数$\\gamma, \\eta$，假设重复 $N_E$ 次后收敛，根据书中第409页算法20.4（LDA的变分参数估计算法）中的第4、5步的循环次数为 $N_E \\cdot N \\cdot K$次；第6步在计算$\\displaystyle \\Psi ( \\sum_{l=1}^K \\gamma_l^{(t)} )$时的算法复杂度为$K$，其余计算的复杂度均为1  \n",
    "则每次总算法复杂度为$N_E \\cdot N \\cdot K \\cdot K$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. M步：估计模型参数 $\\alpha, \\varphi$，根据书中公式(20.63)\n",
    "$$\n",
    "L[\\varphi_w] = \\sum_{m=1}^M \\sum_{n=1}^{N_m} \\sum_{k=1}^K \\sum_{v=1}^V \\eta_{mnk} w_{mn}^v \\log \\varphi_{kv} + \\sum_{k=1}^K \\lambda_k \\left( \\sum_{v=1}^V \\varphi_{kv} - 1 \\right) \n",
    "$$\n",
    "对$\\varphi_{kv}$求偏导并令其为零，归一化求解，得到参数$\\varphi_{kv}$的估计值\n",
    "$$\n",
    "\\varphi_{kv} \\propto \\sum_{m=1}^M \\sum_{n=1}^{N_m} \\eta_{mnk} w_{mm}^v\n",
    "$$\n",
    "其算法复杂度为$M \\cdot N_m$；  \n",
    "再通过证据下界的最大化估计参数 $\\alpha$，根据公式(20.65)\n",
    "$$\n",
    "L_{|\\alpha|} = \\sum_{m=1}^M \\left \\{ \\log \\Gamma \\left( \\sum_{l=1}^K \\alpha_l \\right) - \\sum_{k=1}^K \\log \\Gamma(\\alpha_k) + \\sum_{k=1}^K (\\alpha_k - 1) \\left[ \\Psi(\\gamma_{mk}) - \\Psi \\left( \\sum_{l=1}^K \\gamma_{ml} \\right) \\right] \\right \\}\n",
    "$$\n",
    "计算其关于$\\alpha$的Hessian矩阵，然后应用牛顿法求该函数的最大化，假设牛顿法迭代次数为$N_M$，$\\alpha$包含$K$个话题的狄利克雷分布参数，牛顿法的计算复杂度为$N_M \\cdot K \\cdot K$  \n",
    "则每次总算法复杂度为$M \\cdot N_m + N_M \\cdot K \\cdot K$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;最后一共进行了 $T$ 次 EM 算法的迭代，因此，总算法复杂度为$T \\cdot (N_E \\cdot N \\cdot K \\cdot K + M \\cdot N_m + N_M \\cdot K \\cdot K)$。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 习题20.5\n",
    "&emsp;&emsp;证明变分EM算法收敛。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答思路：**  \n",
    "\n",
    "1. 给出变分EM算法\n",
    "2. 给出EM算法的收敛性\n",
    "3. 证明在变分EM算法中，证据下界随着迭代进行单调递增"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**解答步骤：**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第1步：变分EM算法**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第20章的算法20.3：\n",
    "\n",
    "> &emsp;&emsp;假设模型式联合概率分布$p(x,z | \\theta)$，其中$x$是观测变量，$z$是隐变量，$\\theta$是参数。目标是通过观测数据的概率（证据）$\\log p(x|\\theta)$的最大化，估计模型的参数$\\theta$。使用变分推理，导入平均场$q(z)=\\prod_{i=1}^n q(z_i)$，定义证据下界\n",
    "> $$\n",
    "L(q, \\theta) = E_q[\\log p(x, z | \\theta)] - E_q[\\log q(z)] \\tag{20.39}\n",
    "$$\n",
    "> 通过迭代，分别以$q$和$\\theta$为变量时对证据下界进行最大化，就得到变分EM算法。 \n",
    "> \n",
    "> **算法20.3（变分EM算法）**    \n",
    "> 循环执行以下E步和M步，直到收敛。  \n",
    "> （1）E步：固定$\\theta$，求$L(q, \\theta)$对$q$的最大化。  \n",
    "> （2）M步：固定$q$，求$L(q, \\theta)$对$\\theta$的最大化。  \n",
    "给出模型参数$\\theta$的估计值。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第2步：EM算法的收敛性**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第9.2节的定理9.1：\n",
    "\n",
    "> &emsp;&emsp;**定理9.1** 设$P(Y | \\theta)$为观测数据的似然函数，$\\theta^{(i)}(i=1,2,\\cdots)$为EM算法得到的参数估计序列，$P(Y | \\theta^{(i)})(i=1,2,\\cdots)$为对应的似然函数序列，则$P(Y | \\theta^{i})$是单调递增的，即\n",
    "> $$\n",
    "P(Y | \\theta^{(i+1)}) \\geqslant P(Y | \\theta^{(i)})\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;根据书中第9.2节的定理9.2：\n",
    "\n",
    "> &emsp;&emsp;**定理9.2** 设$L(\\theta)=\\log P(Y | \\theta)$为观测数据的对数似然函数，$\\theta^{(i)}(i=1,2,\\cdots)$为EM算法得到的参数估计序列，$L(\\theta^{(i)})(i=1,2,\\cdots)$为对应的对数似然函数序列。  \n",
    ">  &emsp;&emsp;（1）如果$P(Y | \\theta)$有上界，则$L(\\theta^{(i)})= \\log P(Y | \\theta^{(i)})$收敛到某一值$L^*$；  \n",
    ">  &emsp;&emsp;（2）在函数$Q(\\theta, \\theta')$与$L(\\theta)$满足一定条件下，由EM算法得到的参数估计序列$\\theta^{(i)}$的收敛值$\\theta^*$是$L(\\theta)$的稳定点。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**第3步：证明在变分EM算法中，证据下界随着迭代进行单调递增**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;对于证据下界\n",
    "\n",
    "$$\n",
    "L(q,\\theta) = E_q[\\log p(x,z|\\theta)] - E_q[\\log q(z)] \\tag{1}\n",
    "$$\n",
    "\n",
    "&emsp;&emsp;假设迭代次数从$t$到$t-1$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. E步：固定$\\theta$，求$L(q,\\theta^{(t-1)})$对$q$的最大化，由于 $\\log p(x|\\theta)$并不依赖于$q(z)$，所以 $L(q^{(t)}, \\theta^{(t-1)})$的最大值出现在KL散度等于0的时候，即$q(z)$与后验概率分布$q(z|x,\\theta^{(t-1)})$相等的时候，此时有证据下界等于对数似然函数，即\n",
    "\n",
    "$$\n",
    "\\log p(x|\\theta^{(t-1)}) = L(q^{(t)}, \\theta^{(t-1)}) \\tag{2}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;如下图所示\n",
    "\n",
    "<center>\n",
    "<img src=\"./images/20-5-EM-E step.png\" style=\"zoom: 55%;\">\n",
    "<br/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2. M步：固定$q(z)$求 $L(q,\\theta)$对$\\theta$的最大化，更新模型参数$\\theta$，此时下界$L(q,\\theta)$ 增⼤，有\n",
    "\n",
    "$$\n",
    "L(q^{(t)}, \\theta^{(t-1)}) \\leqslant L(q^{(t)}, \\theta^{(t)}) \\tag{3}\n",
    "$$\n",
    "\n",
    "&emsp;&emsp;由于 $q$ 由模型参数 $\\theta^{(t-1)}$ 确定，并且在 M 步中固定，所以它不会等于更新新的后验概率分布 $p(z|x,\\theta^{(t)})$， 因此 KL 散度⾮零，对数似然函数的增量⼤于证据下界的增量，即\n",
    "\n",
    "$$\n",
    "L(q^{(t)}, \\theta^{(t)})\\leqslant \\log p(x|\\theta^{(t)}) \\tag 4\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;如下图所示\n",
    "<center>\n",
    "<img src=\"./images/20-5-EM-M step.png\" style=\"zoom: 55%;\">\n",
    "<br/>\n",
    "</center>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&emsp;&emsp;结合公式(2)、(3)、(4)，可得\n",
    "$$\n",
    "\\log p(x|\\theta^{(t-1)}) \\leqslant \\log p(x|\\theta^{(t)})\n",
    "$$\n",
    "&emsp;&emsp;即证得在变分EM算法中，$\\log p(x|\\theta)$每次迭代都保证单调递增，根据第2步的定理9.1和定理9.2，因此最后一定收敛。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.5"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "273.188px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  },
  "vscode": {
   "interpreter": {
    "hash": "eb16a45d3969af144375f8a71e40c490388b4fb3c27d50b5e869e018dcf9dd9d"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
